timothy leffel, spring 2017

welcome!

agenda for course:

  • point 1
  • point 2
  • point …

course materials are all on the course website:

http://lefft.xyz/r_minicourse

each week we'll have slides, notes, and a script. little exercises will be interleaved throughout the notes. the best way to write up solutions is to start a new R script called (e.g.) week1_exercises.r and type directly into that.

there will also be a list of links to useful resources up on the site

how to talk to R: via command-line interface

how to talk to R: via default R GUI

how to talk to R: via R Studio IDE

navigating R Studio

navigating R Studio

navigating R Studio

navigating R Studio

navigating R Studio

2. Variables and Assignments

time to start writing code!

# welcome to the R mini-course. in keeping with tradition...
print("...an obligatory 'hello, world!'")
## [1] "...an obligatory 'hello, world!'"

# this line is a comment, so R will always ignore it.
# this is a comment too, since it also starts with "#".

# but the next one is a line of real R code, which does some arithmetic:
5 * 3
## [1] 15
# we can do all kinds of familiar math operations:
5 * 3 + 1
## [1] 16
# 'member "PEMDAS"?? applies here too -- compare the last line to this one:
5 * (3 + 1)
## [1] 20

# usually when we do some math, we want to save the result for future use.
# we can do this by **assigning** a computation to a **variable**
firstvar <- 5 * (3 + 1)
# now 'firstvar' is an **object**. we can see its value by printing it.
# sending `firstvar` to the interpreter is equivalent to `print(firstvar)`
firstvar
## [1] 20

# we can put basically anything into a variable, and we can call a variable
# pretty much whatever we want (but do avoid special characters besides "_")
myvar <- "boosh!"
myvar

myVar <- 5.5
myVar
## [1] "boosh!"
## [1] 5.5
# including other variables or computations involving them:
my_var <- myvar
my_var

myvar0 <- myVar / (myVar * 1.5)
myvar0
## [1] "boosh!"
## [1] 0.6666667

# when you introduce variables, they'll appear in the environment tab of the 
# top-right pane in R Studio. you can remove variables you're no longer
# using with `rm()`. (this isn't necessary, but it saves space in both 
# your brain and your computer's
rm(myvar)
rm(my_var)
rm(myVar)
rm(myvar0)

3. Vectors

# R was designed with statistical applications in mind, so naturally there's
# lots of ways to represent collections or sequences of values (e.g. numbers).

# in R, a "vector" is the simplest list-like data structure.
# you can create a vector with the `c()` function (for "concatenate")
myvec <- c(1, 2, 3, 4, 5)
myvec
## [1] 1 2 3 4 5
anothervec <- c(4.5, 4.12, 1.0, 7.99)
anothervec
## [1] 4.50 4.12 1.00 7.99

# vectors can hold elements of any type, but they must all be of the same type.
# to keep things straight in your head, maybe include the data type in the name
myvec_char <- c("a","b","c","d","e")
myvec_char
## [1] "a" "b" "c" "d" "e"
# if we try the following, R will coerce the numbers into characters:
myvec2 <- c("a","b","c",1,2,3)
myvec2
## [1] "a" "b" "c" "1" "2" "3"
rm(myvec2)

suppose the only reason we created myvec and anothervec was to put them together with some other stuff, and save that to longvec. in this case, we can just remove myvec and anothervec, and use longvec

# you can put vectors or values together with `c()`
longvec <- c(0, myvec, 9, 80, anothervec, 0, NA)
rm(myvec)
rm(anothervec)

longvec
##  [1]  0.00  1.00  2.00  3.00  4.00  5.00  9.00 80.00  4.50  4.12  1.00
## [12]  7.99  0.00    NA

# to see how many elements a vector has, get its `length()`
length(longvec)
## [1] 14
# to see what the unique values are, use `unique()` (you'll get a vector back)
unique(longvec)
##  [1]  0.00  1.00  2.00  3.00  4.00  5.00  9.00 80.00  4.50  4.12  7.99
## [12]    NA
# a very common operation is to see how many unique values there are:
length(unique(longvec))
## [1] 12

# to see a frequency table over a vector, use `table()`
table(longvec)
## longvec
##    0    1    2    3    4 4.12  4.5    5 7.99    9   80 
##    2    2    1    1    1    1    1    1    1    1    1
# note that this works for all kinds of vectors
table(c("a","b","c","b","b","b","a"))
## 
## a b c 
## 2 4 1
table(c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE))
## 
## FALSE  TRUE 
##     4     2

an important but not obvious thing:

R has a special value called NA, which represents missing data.

by default, table() won't tell you about NA's (annoying, ik!). so get in the habit of specifying the useNA argument of table()

# note "DRY" principle (don't repeat yourself)
vec_with_NA <- c(1,2,3,2,2,NA,3,NA,NA,1,1)
table(vec_with_NA)
## vec_with_NA
## 1 2 3 
## 3 3 2
table(vec_with_NA, useNA="ifany") # or "always" or "no"
## vec_with_NA
##    1    2    3 <NA> 
##    3    3    2    3

notice that the structure of the last table command is:

table(VECTOR, useNA=CHARACTERSTRING)

some terminology:

  • table() is a function

  • table() has argument positions for a vector and for a string

  • we provided table() with two arguments:

    • a vector (vec_with_NA)
    • a character string ("ifany")
  • the second argument position was named useNA

  • we used the argument binding syntax useNA="ifany"

argument-binding is kind of like variable assignment, but useNA doesn't become directly available for use after we give it a value.

this might feel kinda abstract, but i promise the intuition will become clearer the further along we get.

some arguments – like useNA here – can be thought of as "options" of the function they belong to.

# here's an example that might clarify the concept of argument binding:
round(3.141592653, digits=4)
## [1] 3.1416

round() is a commonly used function that illustrates an important concept called vectorization.

many functions in R are vectorized by default, which means that they can take an individual value (like the round() call above), or they can take a vector of values.

in the latter case, the function applies pointwise to each element of the vector, and returns a vector with the same length and order as the input:

round(longvec, digits=4)
##  [1]  0.00  1.00  2.00  3.00  4.00  5.00  9.00 80.00  4.50  4.12  1.00
## [12]  7.99  0.00    NA
# if we don't tell it how many digits to round to, it defaults to 0
round(longvec)
##  [1]  0  1  2  3  4  5  9 80  4  4  1  8  0 NA

in fact MOST MATH STUFF IS VECTORIZED

4. Subsetting and Indexing

# rep() and seq() and : and ...


# subsetting vectors (introduce via letters + LETTERS)

here's an analogy you should keep in mind: think of vectors as columns of an abstract spreadsheet (not rows).

in fact, this is a bit more than an analogy in R. R's implementation of a "spreadsheet" – the data frame – is quite literally a list of vectors. the data frame is a beautiful data structure, and is used to represent (flat) datasets e.g. from an excel sheet.

fun fact: python's most popular data analysis library borrows heavily from R, most clearly with its very nice implementation of R's data frame structure.

we'll have a first look at data frames next

5. Data Frames!

a data frame is…

# there are several ways to create data frames, and here's one:
mydf <- data.frame(
  col1=c(1,2,3,4,5,6),
  col2=c("a","b","c","a","b","b")
)
mydf
##   col1 col2
## 1    1    a
## 2    2    b
## 3    3    c
## 4    4    a
## 5    5    b
## 6    6    b
# here's a handful of common functions you'll call on data frames, in order
# to visually inspect it or to refer to some property it has:

dim(mydf)        # a vector of length 2: number of rows, number of cols
## [1] 6 2
nrow(mydf)       # number of rows
## [1] 6
ncol(mydf)       # number of columns
## [1] 2
str(mydf)        # the structure of the data frame
## 'data.frame':    6 obs. of  2 variables:
##  $ col1: num  1 2 3 4 5 6
##  $ col2: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 2
summary(mydf)    # gives useful info about each column
##       col1      col2 
##  Min.   :1.00   a:2  
##  1st Qu.:2.25   b:3  
##  Median :3.50   c:1  
##  Mean   :3.50        
##  3rd Qu.:4.75        
##  Max.   :6.00
names(mydf)      # the names of the columns
## [1] "col1" "col2"

6. Recap and list of functions to learn